Scientific Annals of Computer Science vol. 33 (1), 2023, pp. 35-52 
doi: 10.7561/SACS.2023.1.35 


Shrinkage Estimators for the Intercept 
in Linear and Uplift Regression 


Szymon JAROSZEWICzZ!? and Krzysztof Rupas!? 


Abstract 


Shrinkage estimators modify classical statistical estimators by scal- 
ing them towards zero in order to decrease their prediction error. We 
propose shrinkage estimators for linear regression models which explic- 
itly take into account the presence of the intercept term, shrinking it 
independently from other coefficients. This is different from current 
shrinkage estimators, which treat the intercept just as an ordinary re- 
gression coefficient. We demonstrate that the proposed approach brings 
systematic improvements in prediction accuracy if the true intercept 
term differs in magnitude from other coefficients, which is often the case 
in practice. We then generalize the approach to uplift regression which 
aims to predict the causal effect of a specific action on an individual 
with given characteristics. In this case the proposed estimators improve 
prediction accuracy over previously proposed shrinkage estimators and 
achieve impressive performance gains over original models. 


1 Introduction 


Linear regression is arguably the most important type of statistical model. 
The most frequently used estimator for the model is the so called Ordinary 
Least Squares (OLS) estimator based on minimizing the squared error [2]. 
The method is very well understood theoretically and offers good predictive 
accuracy in many practical cases. 


This work is licensed under the Creative Commons Attribution-NoDerivatives 4.0 
International License 

‘Institute of Computer Science, Polish Academy of Sciences 

?Faculty of Mathematics and Information Science, Warsaw University of Technology 


36 Szymon Jaroszewicz and Krzysztof Rudas 


However, the OLS estimator is rarely optimal, and several methods 
of lowering its predictive error have been developed, primarily by reducing 
the estimator’s variance at the expense of introducing bias [2]. The most 
popular class of such methods is based on regularization. Another, somewhat 
less frequently used, class of such estimators are shrinkage estimators which 
scale the Ordinary Least Squares estimate by a factor a < 1. While less 
popular, shrinkage estimators offer several advantages over regularization 
based methods; for example, there is no need to select regularization pa- 
rameters through cross-validation. The best known such estimator is the 
James-Stein estimator [5] which has a provably lower expected mean squared 
error than the classical OLS estimator, even though the shrinkage factor a is 
calculated based on the training data. Another choice is a class of shrinkage 
estimators based on minimizing predictive MSE directly and substituting 
sample estimates for unknown parameters [7]. 

In this work we analyze ways to shrink OLS estimators for models which 
involve the intercept term. The intercept coefficient is estimated somewhat 
differently from other coefficients and often takes values very different from 
them, so it seems reasonable to apply a different shrinkage factor to it. 
Current shrinkage estimators typically ignore the different nature of the 
intercept and use the same shrinkage factor as for the remaining coefficients. 
Another strategy (also used in regularization based estimators) is to keep the 
original OLS estimate for the intercept term and only shrink the remaining 
coefficients. Here, we propose to use of a separate shrinkage factor for the 
intercept term of the OLS estimator and demonstrate the benefits of such 
an approach. 


1.1 Uplift Modeling 


Another estimation problem discussed in this paper is the development of 
shrinkage estimators for uplift regression. The aim of uplift modeling is to 
estimate the causal effect of an action, such as a medical treatment or a 
marketing campaign on a given individual [8, 11]. 

To clarify the nature of the problem, let us give an example. Consider 
an online shop which, in order to increase sales, offers discounts to selected 
customers. Some of the customers were not going to buy but the discount 
changed their mind. Clearly, this results in increased revenue for the shop. 
Another kind of customers were going to buy from the shop anyway, and used 
the discount simply to spend less money. Putting issues of customer loyalty 
aside, the discount resulted in a loss of income for the store. Of course, only 


Shrinkage Estimators for the Intercept in Linear and Uplift Regression37 


the first group is of interest to shop owners, but classical regression analysis 
is not able to distinguish such customers. 

In fact, the proper way to select targets for an action is to consider 
the difference between the response in case an individual is subjected to the 
action (treated) and the response when the individual is not subjected to 
the action (control). Unfortunately these two pieces of information are never 
known to us simultaneously. Once we send the discount we cannot make the 
customer forget about it. This is known in literature as the Fundamental 
Problem of Causal Inference [3]. 

Uplift modeling is a solution to this problem based on dividing the 
available training sample into two parts: the treatment group, subjected to 
the action, and the control group on which the action is not taken. This 
second group is used as a background against which the true benefit of 
taking the action can be assessed. In uplift regression our aim will be to 
estimate the difference between treatment and control responses for an object 
described by a feature vector x [9]. More details on the problem are given 
in Section 4. 

The second contribution of this work is developing new shrinkage esti- 
mators for uplift regression. The first such estimators have been proposed 
in [10] showing clear practical benefits. However, those methods shrunk the 
intercept term using the same factor as the remaining coefficients. Here we 
develop estimators which use separate shrinkage factors for the intercept 
term and demonstrate experimentally that they give improvements in uplift 
regression’s predictions accuracy. 


1.2. Notation 


Let us now introduce the notation used throughout the paper. Lowercase 
Greek and Latin letters, e.g. a, 8, x, y will denote vectors, and uppercase 
letters, e.g. X will denote matrices. Matrix transpose will be denoted with 
the prime symbol ', matrix trace with Tr, and Onxm and 1yxm will be used 
to denote matrices of, respectively, all zeros and all ones with n rows and m 
columns. I, will denote the n x n identity matrix. 

Vector and matrix random variables will be denoted, respectively, with 
lower- and uppercase boldface letters X, y. Scalar random variables will 
also be denoted with bold face lowercase letters. Statistical estimators will 
be denoted with the usual symbol * above a variable name. Even though 
the estimators are random variables, boldface will not be used to avoid 
notational clutter. E will denote the expectation of a random variable and 


38 Szymon Jaroszewicz and Krzysztof Rudas 


Var the covariance matrix of a random vector. Quantities related to test 
data will be denoted with subscript t. 
Notation specific to uplift modeling will be introduced in Section 4. 


2 Shrinkage Estimator for the Intercept Term in 
Ordinary Least Squares 


We begin by describing the classical Ordinary Least Squares (OLS) regression 
methodology. Only facts needed to understand the remaining part of the 
paper are given, full exposition can be found e.g. in [2]. We assume that we 
have a random vector x € R? of predictor variables, a real-valued response 
variable y, and that there exists a fixed joint distribution P over x and y. 

Further we assume that we have a training set which consist of n samples 
from the joint distribution arranged in an n x (p+1) matrix X of predictors 
and an n-dimensional vector y of responses. We assume that the first column 
of X is a vector of ones and the remaining columns correspond to the p 
predictor variables. The column of ones will allow for easy treatment of the 
intercept term. 

Further, we assume that y is related to X through a linear equation 


y=XBr+e, (1) 


where ( is an unknown coefficient vector and € is a random noise vector 
satisfying the usual assumptions that Ee = 0nx%1, Vare = o7I,, and the 
components of € are independent of each other and independent from X [2]. 

The first component of {, i.e. Go is called the intercept term and is 
responsible for the constant offset of y. 

Notice that we consider the training set X, y to be random. This point 
of view will be used in our analyses. In practice we only have one realization 
of X, y available, which we will denote with letters X, y. 

Our goal is to find an estimator B of @ which, on a new test sample x; 
(for notational simplicity we assume that the test sample is augmented with a 
constant 1 in the first coordinate), y; drawn from the distribution P achieves 
the lowest possible predictive mean squared error 


MSE(8) = Ex Ey Ex, Ey, (Ye _ a8)”, (2) 


where the expectation is taken over the test sample as well as over the 
training set used to obtain @. The most popular estimator is the Ordinary 


Shrinkage Estimators for the Intercept in Linear and Uplift Regression39 


Least Squares (OLS) estimator obtained by minimizing the training set 
squared error ||y — X6||?. The estimator is given by the equation 


Bois = (X'X)"1X’y. (3) 


It is well known that Bots is unbiased, i.e. Ex Ey Bots = B [2]. 

We now move on to derive the new proposed shrinkage estimators. We 
will begin by using a very general form of a shrinkage estimator based on 
Ordinary Least Squares: 


Ba = 0 © Bots = a © (X'X) 1 X'y, (4) 


where a € R?*! is a vector of nonnegative shrinkage coefficients and © is 
the Hadamard (elementwise) product of vectors and matrices. Notice that 
in this setting every regression coefficient 6; may be shrunk separately with 
a different shrinkage factor a;. We begin by analyzing this general shrinkage 
estimator and later present several restrictions to it. In the most important 
one, all coefficients will have equal shrinkage factors, except for the intercept 
which will be shrunk using a separate coefficient. 

Let us now calculate the predictive MSE (Equation 2) of the general 
shrinkage estimator given in Equation 4. Let us first present a simple 
relationship between the Hadamard product of vectors and diagonal matrices. 
For any two vectors a@ and 6 we have 


a © B = diag(a)G = diag(B)a, (5) 


where diag(q@) is a diagonal matrix with main diagonal equal to a. The 
expectation of the shrinkage estimator Gq can thus be written as 


Ex Ey By = diag(a) Ex Ey Bors = diag(a)B = a © 8, (6) 


where { is the true coefficient vector (see Equation 1), and the second 
equality comes from the fact that Gog is unbiased. Define S; = Eg, x;2', 
and let ¢; denote the random noise term on the test sample. We now have 


MSE(q) = E,, Ey, Ex Ey (yt = 1 Ba)’ (ye = 4B) (7) 
= Ey, Ee, Ex By(x18 + e¢ — 248o)! (a) 6 + €: — #1 8a) (8) 

= Ez, Ex Ey(8 — Ba)'ax}(8 — Ba) + Be, €? (9) 

= Ex Ey (6 — Ba)'S:(8 - Ba) +0” (10) 

) 


= (8 — Ex Ey Bo) Si(8 — Ex Ey Bo) + Tr(S; Var(Bo)) to? = (11 


40 Szymon Jaroszewicz and Krzysztof Rudas 


= (8 —a© B)'S;(8 — a © B) + Tr(S; diag’ (a) (Var Bors) diag(a)) + 0? (12) 
a (8 —a© BY'Si(B —a© B) +a’ (S; © Var Bors)a + 0”, (13) 


where Equation 8 follows from the linear model assumptions given in Equa- 
tion 1, Equation 9 from the independence of €; from all other variables, and 
Equation 10 follows from the independence of a; from X and y. Equation 11 
follows from the bias variance decomposition [2, 10] and the properties of 
the trace of a matrix, Equation 12 by substituting the expectation of Ba 
given in Equation 6, and Equation 13 from the properties of the Hadamard 
product, specifically [4, Lemma 5.1.5). 

The shrinkage coefficient vector a will be chosen by minimizing the 
predictive mean squared error given in Equation 13. First, however, we 
need to provide a way to obtain concrete estimators from the very general 
Equation 4. 

Equation 4 gives a form of shrinkage estimators with every regression 
coefficient having a separate shrinkage factor. In practice we want several 
coefficients to share a single shrinkage factor. To make this possible, we will 
assume that the shrinkage coefficient vector has the form 


w= By, (14) 


where ¥y is a vector of gq < p+1 unique shrinkage coefficients and B is an 
(p + 1) x q matrix. For example, to obtain an estimator in which a single 
shrinkage factor is shared by all coefficients except the intercept term, which 
has its own shrinkage factor we can use 


lixi | >| 


a=B = 
« fe Ipxi} [V1 


where Onxm and Inxm denote respectively the matrices of all zeros and all 
ones. Now, yo is the shrinkage factor for the intercept term and 7 the 
common shrinkage factor for the remaining coefficients. 

Formula 4 now becomes 


Ba = @ © Bors = (BY) © Bots = (By) © (X'X) 1 X’y, (15) 


which is the final form of the proposed shrinkage estimator. 
We now need to find the optimal value of y. To this end we will take 
the vector derivative of Equation 13 over ¥: 


Shrinkage Estimators for the Intercept in Linear and Uplift Regression41 


5, [6-20 8YS(8- 0.08) + 0l(S0 Var Bors)a to” 
6) 


_ 3, 8 =(ByY OP) Sb =(8y) Of) + 2 (ByN(S © Var Bots) (BY) 
7 ry 
= 2(8 — (By) © By Sis (B — (By) © B) + 2(By)'(S; © Var Bous) 5” 


= 2(diag(8)By — B)'S; diag(8)B + 2(By)'(S; © Var Bors) B, 


where the first equality follows by substituting Equation 14, and the remain- 
ing ones from basic rules of matrix calculus. Transposing and equating to 
zero we get 


B’ diag(3)S;(diag(8)By — 8) + B’(S; © Var Bots)By = 0gx1 (16) 


and finally 
B' |diag(3)5; diag(8) + (S; © Var Bors)] By = B’ diag(8):8. (17) 


By solving the last system of equations we obtain the optimal shrinkage 
vector @ or, more specifically, the vector of its unique components 74. 

Unfortunately, the estimate is not operational, since we do not know 
the true regression coefficients 8 or the matrix S;. In order to obtain an 
operational estimate 7, we will take the approach used e.g. in [7, 10] by 
replacing the unknown parameters with estimates obtained from the training 
sample. Such estimators are known in statistics as plugin estimators. 

The final proposed estimator 7 of y is obtained by solving the system 
of equations 


B [diag(Sors)S diag(Gors) + ($ © Var Bors) | BY = B’ diag(Bors)SBors, 
(18) 

where all estimates will be based on a specific realization X, y of random 

training data X, y available to us. We will use the following estimators 


: st A rr _ 
Bors = (X'X)*X'y S = ix x Var Bos = =a *, (19) 


where r is the residual vector r = y — X Bots. Note that the variance of 
the OLS estimator is computed assuming a fixed, nonrandom X. This 
simplification is necessary since there is no general estimator of variance of 
the OLS estimator under random predictors. 


42 Szymon Jaroszewicz and Krzysztof Rudas 


2.1 Concrete Shrinkage Estimators for the OLS 


In this section we present several shrinkage estimators for linear regression 
obtained by choosing a different matrix B in Equation 15. 


Single This case corresponds to the classical shrinkage estimators with a 
single common shrinkage factor for all coefficients including the intercept. 
Here we have 


Bsinele = Wp+)x1 


i.e. B is just a column of ones, and ¥ is a one element vector. 


Intercept This model is one of the main contributions of this paper. It 
uses two separate shrinkage coefficients: one for the intercept term and 
another one for all remaining coefficients. Here 


lixa O1x1 
B= : 20 
mn ie 1 Ipx i eo) 


and ¥ is a two element vector. 


Full For the sake of completeness we also investigated the possibility of 
each model coefficient (including the intercept) having a separate shrinkage 
factor. This corresponds to the B matrix equal to the identity matrix 


Bean = L(p+1)x (p+) 


and the y vector having p+ 1 components. 


No Intercept We also tested a strategy where the intercept is not shrunk 
at all: its scaling factor is kept constant at 1. The remaining coefficients 
share a common shrinkage factor. Unfortunately, this type of estimator 
cannot be achieved by simply choosing an appropriate matrix B. 

To achieve the desired result, we used the B matrix of the Intercept 
estimator (Equation 20) and replaced the first equation in the system given 
in Equation 18 with yo = 0. 


Shrinkage Estimators for the Intercept in Linear and Uplift Regression43 


3  Shrinked OLS Regression. 
Experimental Evaluation 


In this section we present an experimental evaluation of the proposed shrink- 
age estimators. In order to be able to fully control the experiment and 
achieve credible results we resorted to the use of simulated data. 

The simulation procedure was conducted as follows. First we chose the 
true coefficient vector 6 based on which the data were simulated. We set 
8; = 1 for odd i > 0 and 6; = 0.5 for even 7 > 0. The experiments were 
conducted for five different values of the intercept term $9: 0.01, 0.1, 1, 10, 
and 100. This way, values of intercept of magnitude smaller and larger than 
the remaining coefficients were tested. 

We generated a random matrix X with n = 30 rows and p = 20 columns 
from a standard normal distribution with all variables uncorrelated. Small 
value of n was chosen to illustrate the estimators behavior in small sample 
scenarios where shrinkage estimators are most useful. Then, the response 
vector y was generated by adding standard normal random noise (o = 1) to 
the vector Xp. 

Model coefficients were then estimated using the standard OLS estimator 
and the shrinkage estimators described above. The predicted mean squared 
error was computed on a test set with 10000 records generated using the 
same procedure. 

The experiment was repeated 100000 times for each value of the inter- 
cept term, and the results have been averaged. Table 1 presents the outcomes. 
The OLS estimator is the standard least squares estimator without shrinkage. 
Notice that the OLS estimator and the No Intercept estimators do not 
depend on the true intercept term used. 

It can be seen the proposed Intercept shrinkage estimator, which uses 
a separate shrinkage factor for the intercept, achieves the lowest error, except 
for the case when the true intercept is of the same order of magnitude as 
the remaining coefficients. The differences are small but consistent. The 
proposed estimator always outperforms the No intercept estimator which 
does not shrink the intercept. Moreover all shrinkage estimators except Full 
outperform the original OLS estimator in all cases. The Full estimator, 
which shrinks each coefficient independently is a clear loser. 

The third column provides the standard deviation of the estimated 
MSE and the fourth its standard error (that is standard deviation devided 
by the square root of the number of experiments). Values in the last column 


44 


Szymon Jaroszewicz and Krzysztof Rudas 


Estimator test set MSE | std. deviation | std. error 
Bo = 0.01 
OLS 3.2163 1.4245 0.0045 
Intercept 3.0254 1.2903 0.0041 
No intercept 3.0597 1.3083 0.0041 
Single 3.0417 1.2938 0.0041 
Full 3.2171 1.2316 0.0039 
Bo = 0.1 
OLS 3.2163 1.4244 0.0045 
Intercept 3.0271 1.2902 0.0041 
No intercept 3.0597 1.3083 0.0041 
Single 3.0421 1.2939 0.0041 
Full 3.2171 1.2316 0.0039 
Bo =1 
OLS 3.2163 1.4244 0.0045 
Intercept 3.0573 1.3003 0.0041 
No intercept 3.0597 1.3083 0.0041 
Single 3.0536 1.3022 0.0041 
Full 3.2863 1.2597 0.0040 
Bo = 10 
OLS 3.2163 1.4244 0.0045 
Intercept 3.0496 1.2979 0.0041 
No intercept 3.0597 1.3083 0.0041 
Single 3.1956 1.4086 0.0045 
Full 3.2687 1.2566 0.0040 
Bo = 100 
OLS 3.2163 1.4244 0.0045 
Intercept 3.0495 1.2977 0.0041 
No intercept 3.0597 1.3083 0.0041 
Single 3.2161 1.4243 0.0045 
Full 3.2683 1.2561 0.0040 


Table 1: Mean Squared Error of various shrinkage estimators 
regression for different values of the true intercept term. 


for linear 


provide the precision of the mean MSE averaged over all simulations. It 
can be seen that, thanks to a relatively large number of simulations, the 
differences between estimators are significant. 


Shrinkage Estimators for the Intercept in Linear and Uplift Regression45 


1.00 frseeececcecceecneeesssssssssssssncccnnnnenensnennns A sR ee 
oy 

oS © 0,995 
ica Q 

n — 

= 0.98 = 0.990 
es sy 

& 0.985 
£ 0.96 - 0.980 
a 0.975 
S = 

0.94 


40 60 80 100 100 200 300 400 500 
n 


Figure 1: Ratio of mean squared errors of the Intercept shrinkage estimator 
and ordinary least squares estimator for p = 20 variables (left) and p = 100 
variables (right) for growing number of training records (n) 


To evaluate the gains form using the estimators for different data sizes 
we compared the Intercept shrinkage estimator which performed best in 
Table 1 with the standard OLS estimator. The value of 59 = 1 was used 
for all charts. The results are shown in Figure 1 for the cases of p = 20 and 
p = 100 variables in the model and growing numbers of data records. 

It can be seen that the gains in expected mean squared error are 
relatively small but consistent. Shrinkage estimators are most useful for 
small datasets. In the following sections we will demonstrate that larger 
gains are possible for uplift models. 


4 Shrinkage Estimators for Uplift Regression 


Let us now proceed to the case of uplift modeling. In this problem we have 
two training sets: treatment and control. The quantities related to the 
treatment group will be denoted with superscript TJ and to the control group 
with superscript C. Quantities related to the uplift (i.e. the conditional 
treatment effect) will be denoted with superscript U. For example, 8° will 
denote the true coefficient vector of the linear response in control cases and 
BY the true coefficient vector of the linear strength of effect of the action. 
Let us now state model assumptions, analogously to Equation 1: 


yo = XC BC+, 
yl = XTBC4 XT BY 4 eT = XT BT 4 eT. 


46 Szymon Jaroszewicz and Krzysztof Rudas 


Notice that we assume linear response in the control group (with true 
coefficient vector 8°) and a linear conditional effect of the action (with true 
coefficient vector BY). As a result, the response in the treatment group is 
also linear with coefficients 67 = B° + 6. The parameter of interest, which 
we want to estimate is BY. 

We will assume randomized assignment of cases to the treatment group 
to guarantee causal nature of discovered relationships [8, 9]. 

As the base uplift estimator to which shrinkage will be applied, we use 
the double model approach [9] (also known as T-learner), which is also the 
base model used to obtain shrinkage uplift estimators in [10]. The model is 
simply the difference of two OLS regression estimators built independently 
on the treatment and control datasets: 


BY = BBs — BBug = (XTYXTYAXTY™ — (XPV XA XC”, (21) 


where the subscript d stands for ‘double’ and aan Boa are OLS estimators 
of, respectively, treatment and control response coefficient vectors. Notice 
that since we assume that the treatment assignment is random, this simple 
model is able to obtain causal predictions. 

In order to repeat the derivation given in Section 2 we will now need two 
shrinkage coefficient vectors a© and a? , and two vectors of unique shrinkage 
coefficients y° and 7" satisfying 


al’ = By, ae = By’, 


where the matrix B is assumed to be identical for both groups. The final 
form of the proposed uplift shrinkage estimator is 


Bye et =a" © Bors — 0 © Bis = (BY) © Bots — (BY°) © BEts, (22) 
which has a similar form to the estimator used in [10], except that we will 
now allow different shrinkage factors for different coefficients through the 


use of the matrix B. It is easy to see that the expectation of the estimator 
is equal to 


E Bur 4c = Exec Eyc Exr Ee Bur ae —al © pr =2° © Be. 


Let us now derive the expression for the MSE of the estimator, analogously 
to Equations 1-13. We have 


MSE(6%c yr) = Ex, Exc Eyc Exr Byr (aj 8" — abr gc)? (23) 
= Exe Eyc Exr Eyr (ae — Ber 4c) S:(BY = Ber ac) (24) 


Shrinkage Estimators for the Intercept in Linear and Uplift Regression47 


= (BY — E Bir 4c)'Si(8% — E Byr gc) + Tr(S; Var(Birqc)) (25) 
= (BY — a? © BF + 0% © B°YS(BY — a © BY + 0% © BY) 

+ Tr(S; Var(a? © Brg) + Tr(S: Var(a® © BErs)) (26) 
= (BY a oe 4 a” © BY S4(BY Oe +o° aR 

+ Tr(S; diag(a”) Var(8Grs) diag(a*)) 

+ Tr(S; diag(a®) Var(85,5) diag(a)) (27) 
= (BF — af © BF +a © B°YS(BY — oF © BY + 0° © BY) 

+ (a7) (5; © Var(BGrs))o" + (a°)'(Si © Var(BGrs))e°, 


where Equation 26 follows from the independence of treatment and control 
OLS estimators. We note that the MSE of the treatment effect is often 
called PEHE in literature [1]. 

Let us now take the derivative of the above expression with respect 
to y7 (the derivation for y© is analogous). We have 


ur (87 — a7 © 6 + a° © B°Y'S,(8" — a7 © 6 + a° OB") 


+ (ab (5; © Var(A5rg))a? + (0°) (S; © Var(BSzs))a°| (28) 
= ano" — al © BT 4.0% © B°Y'S,(BY — aT © BT 4.0% © B°) 
+ sola) (Se @ Van(br3))a” (29) 
= o(8" _ (By") op + (By°) 2 8°) Sian (BY = oe © pF + a? © B°) 
x O 
+ 2(Bq7)/(S: 0 Var(68is)) 5-7 B07 (30) 
= 2(diag(67)By" — diag(B°)By° — BY)'S; diag(B7)B 
+ 2(By")'(S; © Var( Brg) B. (31) 


Transposing and equating to zero we get 


B' |diag(8")S; diag(6") + $; © Var(3$,3)| By” 
— B' diag(8")S; diag(8°)By° = B' diag(87)5,8". (32) 


48 Szymon Jaroszewicz and Krzysztof Rudas 


By an analogous argument, taking the derivative over y© yields 
-B diag(9°) 5; diag(3") By" + B|diag(9°)S, ding( 6°) 
+S; © var( 61s) By? = —B' diag(8°)S,8". (33) 


Together, Equations 32 and 33 make up a system of equations through which 
we can find the shrinkage coefficient vectors y© and y". 

Unfortunately, as was the case with shrinking the Ordinary Least 
Squares estimator in Section 2, the system depends on several quantities 
which are unknown to us during estimation. We take the same strategy as 
we did in Section 2, i.e. substituting the following training set estimators of 
those quantities: 


Coe I ys, Poe Oy, 8 x, 
(34) 


where X7,y" is the treatment training sample, X©, y© is the control training 
sample, X is a concatenation of X7 and X©, and n is the total number of 
records in both training sets. The formulas for treatment and control OLS 
variances are analogous to Equation 19 and the unknown BY is replaced 
with ay given in Equation 21. 


Definition 1 The estimator defined jointly by Equations 22, 82, 33 and 84 
will be called the separately shrinked uplift estimator. 


When all coefficients are shrunk identically the estimator becomes the uplift 
MSE-minimizing estimator from [10]. 


4.1 An Alternative Definition of Uplift Shrinkage Estimators 


Notice that the double model given in Equation 21 yields a single parameter 
vector oe If we were able to estimate the variance of this vector, we could 
directly apply to it the shrinkage estimators developed for OLS regression 
in Section 2. Since the least squares estimates on treatment and control 
training sets are independent, the variance of By can be estimated simply as 
fad cal 


Thal 
var fy = EE axtyxty 4 Oe 


(XS XS), (35) 


Shrinkage Estimators for the Intercept in Linear and Uplift Regression49 


where n? and n© are the numbers of, respectively, treatment and control 


training samples, and r’, r© treatment and control residual vectors. Notice 
that the expression above is the sum of variances of treatment and control 
least squares estimators given in Equation 19. We will substitute this 
equation in place of the variance of OLS estimator into Equation 19 to 
obtain the following shrinked uplift estimator: 


Definition 2 An estimator 
© By = (BY) © Be, (36) 
where ¥ is obtained by solving the system of equations 
B’ [diag(37)$ diag(8Y) + (5 © Var BY)| BY = B’ diag(BY)SBY, — (37) 


with S' given in Equation 34 and Var Be in Equation 35 will be called the 
jointly shrinked uplift estimator. 


5 Shrinked Uplift Regression. 
Experimental Evaluation 


In this section we present an experimental evaluation of uplift shrinkage 
estimators on simulated data. The simulation protocol is similar to that 
used in Section 3, except that we now have two training sets: treatment 
and control. They both have 30 data records and 20 variables (excluding 
intercept). The control response coefficients 3c are the same as the regression 
coefficients in Section 3 with the control group intercept isi = 0.1. The 
coefficient vector BY of the linear conditional effect (the quantity of interest) 
has all coefficients equal to 0.1, except for the intercept, Be. , for which four 
different values of 0.01, 0.1, 1, and 10 were used. 

For each value of the intercept, the simulation has been repeated 100 000 
times. The results are shown in Table 2. Double denotes the double 
estimator given in Equation 21 which does not use shrinkage. 

First, it should be noted that several shrinkage estimators allowed for 
achieving dramatic reduction in MSE over the original double regression 
model. MSE was in some cases reduced more than ten times. 

It can also be seen that for small values of the uplift intercept term By : 
using a single shrinkage factor for all coefficients yielded the best model. 
The proposed separately shrinked Intercept method was slightly worse. 


50 Szymon Jaroszewicz and Krzysztof Rudas 


Separately shrinked estimator || Jointly shrinked estimator 
(Definition 1) (Definition 2) 
Estimator | test MSE | s. dev. S. err. test MSE | s. dev. | s. err. 
BY =0.01 
Double 4.4384 | 2.2681 | 0.0072 
Intercept 0.4460 | 0.5087 | 0.0016 1.709 1.5438 | 0.0049 
Single 0.3647 | 0.4712 | 0.0015 1.6875 1.5412 | 0.0049 
Full 3.4116 1.6822 | 0.0053 2.1944 1.6603 | 0.0053 
py Sa 
Double 4.4393 | 2.2478 | 0.0071 
Intercept 0.4479 | 0.5016 | 0.0016 1.7102 1.5644 | 0.0049 
Single 0.3694 | 0.4642 | 0.0015 1.6904 1.5632 | 0.0049 
Full 3.4161 1.6633 | 0.0053 2.1962 1.6808 | 0.0053 
By =1 


Double 4.4326 | 2.2539 | 0.0071 


Intercept 0.499 0.5175 | 0.0016 1.8654 1.5783 | 0.005 
Single 1.1209 0.5666 | 0.0018 2.1796 1.5641 | 0.0049 
Full 3.5075 1.6981 | 0.0054 2.3714 1.701 | 0.0054 

C0 

Double 4.4292 2.2382 | 0.0071 
Intercept 0.4946 | 0.5127 | 0.0016 1.825 1.5843 | 0.005 
Single 7.4138 2.9874 | 0.0094 5.0334 2.5958 | 0.0082 
Full 3.4993 1.6825 | 0.0053 2.3296 1.7155 | 0.0054 


Table 2: Mean Squared Error of various shrinkage estimators for uplift 
regression for different values of the true intercept term By. : 


The picture changed for larger values of the intercept, where the proposed 
method was a clear winner. The Single strategy actually achieved the worst 
result for pe = 10. 


Also, the jointly shrinked uplift estimators performed significantly worse 
than separately shrinked estimators. An exception was the Full case, which, 
however, was still not competitive against the Intercept method. 

Figure 2 compares the ratio of the MSE’s of the proposed separately 
shrinked Intercept method and the simple double model for p = 20 and 
p = 100 variables and growing number of data records. The value of fete = 10 
was used. Overall, it can be seen that for uplift regression potential gains 


Shrinkage Estimators for the Intercept in Linear and Uplift Regression51 


= Cy 
S 0.125 = 
8 Re 
ap ty 0-15 
HB 0.120 B 
a = 
ae ie 
& 0.115 & 0.10 
Z_ 0.110 a 
A & 0.05 
= = 
40 60 80 100 100 200 300 400 500 
TC TOO 


n,n nin 


Figure 2: Ratio of mean squared errors of the separately shrinked Intercept 
estimator and the double uplift estimator for p = 20 variables (left) and 
p = 100 variables (right) for growing number of training records (n7,n@) 
from using shrinkage estimators can be much larger than for classical linear 
regression. Moreover, the gains remain very high over a broad range of 
parameter values. 


6 Conclusions 


The paper investigated various shrinkage estimators for ordinary linear 
regression and for uplift regression whose aim is the estimation of causal 
effect of some action. The novelty of the proposed estimators lies in how 
shrinkage was applied to the intercept term: a topic ignored in the current 
literature. We have demonstrated the benefits of using a separate shrinkage 
factor for the intercept term. Our proposed shrinkage estimators achieved 
consistent improvements in predictive mean squared error for both ordinary 
and uplift regression, with significant gains in the latter case. A possible 
topic of future research is extending the results to other types of models, 
such as those applicable to survival data used frequently in medicine [6]. 


References 


[1] Susan Athey and Guido Imbens. Recursive partitioning for heteroge- 
neous causal effects. Proceedings of the National Academy of Sciences, 
113(27):7353-7360, 2016. doi:10.1073/pnas. 1510489113. 


52 Szymon Jaroszewicz and Krzysztof Rudas 


[2] Christian Heumann, Thomas Nittner, Sandro Scheid, C.Radhakrishna 
Rao, and Helge Toutenburg. Linear Models: Least Squares and Alter- 
natives. Springer New York, 2013. doi:10.1007/978-1-4899-0024-1. 


[3] Paul W. Holland. Statistics and causal inference. Journal of the 
American Statistical Association, 81(396):945—960, 1986. doi:10.2307/ 
2289064. 


[4] Roger A. Horn and Charles R. Johnson. Topics in Matrix Analysis. 
Cambridge University Press, 1994. doi:10.1017/CB09780511840371. 


[5] Willard James and Charles Stein. Estimation with quadratic loss. In 
Jerzy Neyman, editor, Proceedings of the Fourth Berkeley Symposium 
on Mathematical Statistics and Probability, Volume 1: Contributions to 
the Theory of Statistics, pages 361-379, 1961. 


[6] Szymon Jaroszewicz and Piotr Rzepakowski. Uplift modeling with 
survival data. In ACM SIGKDD Workshop on Health Informatics 
(HI-KDD’14), 2014. 


[7] Akio Namba and Kazuhiro Ohtani. MSE performance of the weighted 
average estimators consisting of shrinkage estimators. Communications 
in Statistics - Theory and Methods, 47(5):1204—-1214, 2018. doi:10. 
1080/03610926. 2017 .1316860. 


[8] Nicholas J. Radcliffe and Patrick D. Surry. Real-world uplift modelling 
with significance-based uplift trees. Portrait Technical Report TR-2011- 
1, Stochastic Solutions, 2011. 


[9] Krzysztof Rudas and Szymon Jaroszewicz. Linear regression for uplift 
modeling. Data Mining and Knowledge Discovery, 32(5):1275-1305, 
Sep 2018. 


[10] Krzysztof Rudas and Szymon Jaroszewicz. Shrinkage estimators for 
uplift regression. In Ulf Brefeld, Elisa Fromont, Andreas Hotho, Arno 
Knobbe, Marloes Maathuis, and Céline Robardet, editors, Proceedings of 
the European Conference on Machine Learning and Principles and Prac- 
tice of Knowledge Discovery in Databases (ECML/PKDD’19), pages 607— 
623. Springer-Verlag, 2019. doi: 10.1007/978-3-030-46150-8_36. 


[11] Piotr Rzepakowski and Szymon Jaroszewicz. Decision trees for uplift 
modeling with single and multiple treatments. Knowledge and Informa- 
tion Systems, 32(2):303-327, 2012. doi:10.1007/s10115-011-0434-0. 


©) Scientific Annals of Computer Science 2023 


